Visual Transfer Learning: Informal Introduction 
and Literature Overview 
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Abstract 

Transfer learning techniques are important to handle small training 
sets and to allow for quick generalization even from only a few examples. 
The following paper is the introduction as well as the literature overview 
part of my thesis related to the topic of transfer learning for visual recog- 
nition problems. 



1 Motivation 

As humans we are able to visually recognize and name a large variety of ob- 



ject categories. A rough estimation of Biederman (1987) suggests that we know 



approximately 30.000 different visual categories, which corresponds to learning 
five categories per day, on average, in our childhood. Moreover, we are able 



to learn the appearance of a new category using few visual examples (Parikh 



and Zitnick 2010). Despite the impressive success of current machine vision 



systems (Everingham et al. 2010), the performance is still far from being com- 



parable to human generalization abilities. Current machine learning methods, 
especially when applied to visual recognition problems, often need several hun- 
dreds or thousands of training examples to build an appropriate classification 
model or classifier. Transfer learning techniques try to reduce this still existing 
gap between human and machine vision. 

The importance of efficient learning with few examples can be illustrated 
by analyzing current large-scale datasets for object recognition, such as La- 
belMe ( |Russell et ah] |2008| |Torralba et~aTj |2010[ ). Figure |l(a)| shows the rela- 
tive number of object categories in LabelMe that possess a specific number of 
labeled instances. A large percentage (over 60%) of all categories only have one 
single labeled instance. Therefore, even in datasets which include an enormous 
number of images and annotations in total, the lack of training data is a more 
common problem than one might expect. The plot also shows that the num- 
ber of object categories with k labeled instances follows an exponential function 
/3-k~ a , which is additionally illustrated in the right log-log plot of Figure[l] This 
phenomenon is known as Zipf's law (Zipf, 1949) and can be found in language 
statistics and other scientific data. 

With current state-of-the-art approaches we are able to build robust object 
detectors for tasks with a large set of available training images, like detecting 
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Figure 1: (a) Number of object categories in the LabelMe dataset for a specific 
number of labeled instances (inspired by Wang et al. (201Cty ); (b) Extended 
plot in logarithmic scale illustrating Zipf 's law in the first part of the plot ( Zipf 
1949). Similar statistics can be derived for ImageNet ( Deng et al.| 2009) and 



Tinylmages (Torralba et al. 
April 2008. 



2008). The LabelMe database was obtained in 



pedestrians or cars ( Felzenszwalb et al. 2008). However, if we want to extract 



a richer semantic representation of an image, such as trying to predict different 
visual attributes of a car (model type, specific identity, etc.), we are likely not 
able to rely on a sufficient number of images for each new category. Therefore, 
high-level visual recognition approaches frequently suffer from weak training 
representations. 



1.1 Industrial Applications 

Problems related to a lack of learning examples are not restricted to visual 
object recognition tasks in real-world scenes, but are also prevalent in many 
industrial applications. Collecting more training data is often expensive, time- 
consuming or practically impossible. In the following, we give three examples 
tasks in which such a problem arises. 

One important example is face identification (Tan et al. 20061, where the 
goal is to estimate the identity of a person from a given image. For example, such 
a system has to be trained with images of each person being allowed to access a 
protected security area. Obtaining hundreds of training images for each person 
is thus impractical, especially because the appearance should vary and include 
different illumination settings and clothing, which leads to a time-consuming 
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Another interesting application scenario is the prediction of user preferences 
in shop systems. The goal is to estimate the probability that a client likes a 
new product, given some previous product selections and ratings. If a machine 
learning system quickly generalizes from a few given user ratings and achieves 
a high performance in suggesting good products to buy, it is more probable 
that the client will use this shop frequently. In this application area, solving the 
problem of learning with few training examples is simply a question of cost. The 



economical importance of the problem can be seen in the Netflix prize (Bennett 



and Lanning 2007), which promised one million dollars for a new algorithm 



which improves the rating accuracy of a DVD rental system. This competition 
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Figure 2: Images of two object categories (fire truck and okapi) from the Caltech 
256 database {Griffin et all [20071) 



has lead to a large amount of machine learning research related to collaborative 
filtering, which is a special case of knowledge transfer and is explained in more 
detail in Section [21 

A prominent and widely established field of application for machine learning 



and computer vision is automatic visual inspection (AVI) (Chin and Harlow 



1982 ) . To achieve a high quality of an industrial production, several work pieces 
have to be checked for errors or defects. Due to the required speed and the high 
cost of manual quality control, the need for automatic visual defect localization 
arises. Whereas images from non-defective data can be easily obtained in large 
numbers, training images for all kinds of defects are often impossible to collect. 
A solution to solve this problem is to handle it as an outlier detection or one- 
class classification problem. In this case, learning data only consists of non- 
defective examples and is used afterward to detect examples not belonging to 
the underlying distribution of the data. 

1.2 Challenges 

What are the challenges and the problems of traditional machine learning meth- 
ods in scenarios with few training examples? First of all, we have to clarify our 
notion of "few" . Common to all traditional machine learning methods are their 



underlying assumptions, which were formulated by Niemann (1990). The first 
postulate states: 

Postulate 1: "In order to gather information about a task domain, 



a representative sample of patterns is available." (Niemann 1990 
Section 1.3, p. 9) 

Therefore, a scenario with few training examples can be defined as a classi- 
fication task that violates this assumption by having an insufficient or non- 
representative sample of patterns. Of course, this notion depends on the specific 
application and on the complexity of the task under consideration. 

One of the difficulties is the high variability in the data of high-level visual 
learning tasks. Some images from an object category database are given in Fig- 
ure [2j A classification system has to cope with background clutter, different 
viewpoints, illumination changes and in general with a large diversity of the 
category (intra-class variance). On the one hand, this can only be performed 
with a large amount of flexibility in the underlying model of the category, such 
as using a large set of features extracted from the images and a complex classifi- 
cation model. On the other hand, learning those models requires a large number 



of (representative) training examples. These conditions turn learning with few 
examples into a severe problem especially for high-level recognition tasks. 

The trade-off between a highly flexible model and the number of training 
examples required can be explained quite intuitively for polynomial regression: 
Consider a set of n sample points of a one-dimensional m-order polynomial. 
The order of the polynomial is a measure of the complexity of the function. For 
the noise-free case, we need n > m + 1 examples to get an exact and unique 
solution. In contrast, noisy input data requires a higher number of examples 
to estimate a good approximation of the underlying polynomial. This direct 
dependency to the model complexity becomes more severe if we increase the 
input dimension D. The number of coefficients that have to be estimated, and 
analogous the numb er of examples required, grows polynomial in d according 



to 



0{d m ) (Bishop 



2006 



Exercise 1.16). This immense increase in 
the amount of required training data is known in a broader as the curse of 
dimensionality. 

A deeper insight and an analysis for classification rather than regression 
tasks is offered by the theoretical bounds derived in statistical learning theory 



for the error of a classification model or hypothesis g ( Cucker and Smale 2002 



Cristianini and Shawe- Taylor 2000 1 . Assume that a learner selects a hypothesis 



from a possibly infinite set of hypotheses T~L which achieves zero error on a given 
sampled training set of size n. The attribute "sampled" refers to the assump- 
tion that the training set is a sample from the underlying joint distribution of 
examples and labels of the task. Due to this premise the following bound is 
not valid for one-class classification. A theorem proved by Shawe- Taylor et al. 
( 1993 Corrolary 3.4) states that with probability of 1 — S the following number 



of training examples is sufficient for achieving an error below e: 



n > 



1 



2 In [ I) dim(K) +ln (~ 



(1) 



The term dim(H) denotes the VC dimensior^ of H and can be regarded as a 
complexity measure of the set of available models or hypotheses. For example, 
the class of all D-dimcnsional linear classifiers including a bias term has a VC 
dimension of D + 1 (Vapnik 2000| Section 4.11, Example 2). Let us now take 
a closer look on the bound in Eq. (JlJ. If we fix the maximum error e and 
choose an appropriate small value for S, we can see that the sufficient number of 
training examples depends linearly on the VC dimension dim(H). This directly 
corresponds to our previous example of polynomial regression, because th e VC 
dimension of _D-variate polynomials of up to order m is exactly ( D ^ n ) (Ben- 
David and Lindenbaum 1998 ) . 



1.3 The Importance of Prior Knowledge 



Nearly all machine learning algorithms can be formulated as optimization prob- 
lems, whether in a direct way, such as done by support vector machines (SVM) (|Cris- 



tianini and Shawe- Taylor 2000 ) , or in an indirect manner like boosting ( Fried- 



man et al. 2000 ) approaches. From this point of view, we can say that learning 



with few examples inherently tries to solve an ill-posed optimization problem. 



1 YC is an abbreviation for VapnikChervonenkis. 
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Therefore, it is not possible to find a suitable well-defined solution without in- 
corporating additional (prior) information. The role of prior information is to 
(indirectly) reduce the set of possible hypotheses. For example, if we know in 
advance that for a classification task only the color of an object is important, 
e.g., if we want to detect expired meat, only a small number of features have 
to be computed and a lower dimensional linear classifier can be used. In this 
situation the VC dimension is reduced, which results in a lower bound for the 
sufficient number of training examples (c/Eq. Q). 

Introducing common prior knowledge into the machine learning part of a 
visual recognition system is often done by regularization techniques that penalize 
non-smooth solutions (Schlkopf and Smola 2001) or the "complexity" of the 
classification model. Examples of such techniques are L2-regularization, also 
known as Tikhonov regularization (Vapnik 2000), or Li-regularization, which 
is mostly related to methods trying to find a sparse solution (Seeger 2008). 

Other possibilities to incorporate prior knowledge include semi- supervised 
learning and transductive learning, which utilize large sets of unlabeled data 
to support the learning process (Fergus et al. 2009). Unlabeled data can help 



to estimate the underlying data manifold and, therefore, are able to reduce the 
number of model parameters. However, the use of unlabeled data is not studied 
in this thesis. 



2 Knowledge Transfer and Transfer Learning 

Machine learning tasks related to computer vision always require a large amount 
of prior knowledge. For a specific task, we indirectly incorporate prior knowl- 
edge into the classification system by choosing image preprocessing steps, fea- 
ture types, feature reduction techniques, or the classifier model. This choice 
is mostly based on prior knowledge manually obtained by a software developer 
from previous experiences on similar visual classification tasks. For instance, 
when developing an automatic license plate reader, ideas can be borrowed from 
optical text recognition or traffic sign detection. Increasing expert prior knowl- 
edge decreases the number of training examples needed by the classification 
system to perform a learning task with a sufficient error rate. However, this 
requires a large manual effort. 

The goal of some techniques presented in this work is to perform transfer 
of prior knowledge from previously learned tasks to a new classification task 
in an automated manner, which is known as transfer learning and which is 
a special case of knowledge transfer. The advantage compared to traditional 
machine learning methods, or independent learning, is that we do not have 
to build new classification systems from the scratch or by large manual effort. 
Previously known tasks used to obtain prior knowledge are referred to as support 
tasks and a new classification task only equipped with few training examples 
is called target task. In the following, we concentrate on inductive transfer 



learning ( Pan and Yang 2010 ), which assumes that we have labeled data for the 
target and the support tasks. Especially interesting are situations where a large 
number of training examples for the support tasks exists and prior knowledge 
can be robustly estimated. Other terms for transfer learning are learning to 
learn (|Thrun and Pratt 19971, lifelong learning and interclass transfer (Fei-Fei 



et al. 2006). 
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Figure 3: The basic idea of transfer learning for visual object recognition: a lot 
of visual categories share visual properties which can be exploited to learn a 
new object category from few training examples. 



The concept of transfer learning is also one of the main principles that explain 



the development of the human perception and recognition system (Brown and 



Kane 1988). For example, it is much easier to learn Spanish if we are already 



able to understand French or Italian. The knowledge transfer concept is known 



in the language domain as language transfer or linguistic interference (Odlin 
1989[ ). We already mentioned that a child quickly learns new visual object 



categories in an incremental manner without using many learning examples. 
Figure [3] shows some images of a transfer learning scenario for visual object 
recognition. Generalization from a single example of the new animal category, 
on the right hand side of Figure [3j is possible due to a large set of available 
memories (images) of related animal categories. These categories often share 
visual properties, such as typical object part constellations (head, body and four 
legs) or similar fur texture. 

Developing transfer learning techniques and ideas requires answering four 
different questions: "What, how, from where and when to transfer?". First 
of all, the type of knowledge which will be transferred from support tasks to 
a new target task has to be defined, e. g., information about common suitable 
features. Detailed examples are listed in a paragraph of Section pP] The transfer 
technique applied to incorporate prior knowledge into the learning process of the 
target task strictly depends on this definition but is not determined by it. For 
example, the relevance of features for a classification task can be transferred 
using generalized linear models (Lee et al. 20071 or random decision forests. 
Prior knowledge is only helpful for a target task if the support tasks are somehow 
related or similar. In some applications not all available previous tasks can be 
used as support tasks, because they would violate this assumption. Giving an 
answer to the question "From where to transfer?" means that the learning 
algorithm has to select suitable support tasks from a large set of tasks. These 
learning situations are referred to as learning in heterogeneous environments. Of 
course, we expect that additional information incorporated by transfer learning 
always improves the recognition performance on a new target task, because it 
is the working hypothesis of transfer learning in general. However, in machine 
learning there is no guarantee at all that the model learned from a training 
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Figure 4: The concept of (a) traditional machine learning with independent 
tasks, (b) transfer learning, (c) multitask learning and (d) multi-class transfer 
learning (inspired by Pan and Yang (2010)). 



set is also appropriate for all unseen examples of the task, i.e. there are no 
deterministic warranties concerning the generalization ability of a learner [^] 
Therefore, knowledge transfer can fail and lead to worse performance compared 
to independent learning. This event is known as negative transfer and happens 
in everyday life. For example, if we use "false friends" when learning a new 
language, e.g., German speakers are sometimes confused about "getting a gift" 
because the word "Gift" is the German word for poison, which is likely not a 
thing you are happy to get. Situations in which negative transfer might occur 
are difficult to detect. 

Besides transfer learning, another type of knowledge transfer is multitask 
learning which learns different classification tasks jointl}]^] Combined estima- 
tion of model parameters can be very helpful, especially if a set of tasks are 
given, with each having only a small number of training examples. In contrast 
to transfer learning there is no prior knowledge obtained in advance, but the 
model parameters of each task are coupled together. For example, given a set of 
classification tasks, relevant features can be estimated jointly and all classifiers 
are learned independently with the reduced set of features. A possible applica- 
tion is collaborative filtering as mentioned in Section [l] In the following thesis, 
we stick to transfer learning but borrow some ideas from multitask learning 
approaches. 

Figure [4] summarizes the conceptual difference between independent learn- 
ing, transfer learning, and multitask learning. Furthermore, we illustrate the 
principle of multi-class transfer learning. In contrast to all other approaches, 
transfer within a single multi-class task is considered rather than between several 

2 Note that even the bound in Eq. Q only holds with probability 1 — 5. 
foue etH] ( |2007t use the term symmetric multitask learning to refer to jointly learning 
tasks and asymmetric multitask learning refers to our use of the term transfer learning. 
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(binary) classification tasks. To emphasize this fact, we use the term target class 
rather than target task. The main difficulty is that the target class has to be 
distinguished from the support class, even though information was transferred 
and exploits their similarity. 

It should be noted that another area of knowledge transfer is transductive 
transfer learning, which concentrates on transferring knowledge from one appli- 
cation domain to another. For example, the goal is to recognize objects in low 
quality webcam images with the support of labeled data from photos made by 
digital cameras. Related terms are sample-selection bias, covariate shift, and 



domain adaptation (Pan and Yang 2010). 



3 Literature Overview: Transfer Learning 

There is a large body of literature trying to handle the problem of learning with 
few examples. A lot of work concentrates on new feature extraction methods, 
or classifiers, which show superior performance to traditional methods espe- 
cially for few training examples (Levi and Weiss 2004). The few examples 



problem was also tackled by introducing manual prior knowledge of the ap- 
plication domain, such as augmenting the training data by applying artificial 



2000 Bayoudh et al. 2007) also known as data 



transformations (Duda et al. 
manufacturing . 

In the following, we concentrate on methods related to the principle of trans- 
fer learning and multitask learning as introduced in the previous section. Other 



similar surveys and literature reviews can be found in the work of Fei-Fei ( 2006 ) 



from a computer vision perspective and the journal paper of |Pan and Y ang 
(2010), which gives a comprehensive overview of the work done in the machine 



learning community. There is also a textbook by Brazdil et al. ( 2009 ) covering 



the broader area of meta-learning , and the edited book of Thrun and Pratt| 
(1997). about the early developments of learning to learn. 



3.1 What and how to transfer? 

The type of knowledge which is transferred from support tasks to a target task 
is often directly coupled with the method used to incorporate this additional 
information. Therefore, we give a combined literature review on answers to both 
questions. 



Learning Transformations One of the most intuitive types of knowledge 
which can be transferred between categories is application-specific transforma- 
tions or distortions. While in data manufacturing methods these transforma- 
tions have to be defined using manual supervision, transfer learning methods 
learn this information from support tasks. 

For example, a face recognition application can significantly benefit from 
transformations transferred from other persons using optical flow techniques 



(Beymer and Poggio 1995). Estimating latent transformations and distortions 



of images (e.g. illumination changes, rotations, translations) within a category 
is proposed by Miller et al. (2000) and Learned-Miller (20061 using an opti- 



mization approach. Their approach called Congealing tries to minimize the 
joint entropy of gray-value histograms in each pixel with a greedy strategy. The 
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obtained transformations can be directly applied to the images of a target task 
and used in a nearest neighbor approach for text recognition. Restricting and 
regularizing the complexity of the class of transformations during estimation is 
important for a good generalization, because it additionally introduces generic 
prior knowledge. The original Congealing approach proposed a heuristic normal- 
ization step directly applied during optimization. An extension without explicit 
normalization, and a study of different complexity measures, can be found in 



the work of Vedaldi et al. (2008) and Vedaldi and Soatto (2007). Huang et al. 



( 2007 ) use Congealing to align images of cars or faces with local features. 



Shared Kernel or Similarity Measure How we compare objects and im- 
ages strictly depends on the current task. A distance metric or a more general 
similarity measure can be an important part of a classification system, e.g. in 
nearest neighbor approaches. The term kernel is a related concept which also 
measures the similarity between input patterns. Hence, a distance metric or a 
kernel function is an important piece of prior knowledge which can be transferred 
to new tasks. 



The early work of Thrun ( 1996) used neural network techniques to estimate 



a similarity measure for a specific task. In general, the idea of estimating an 
appropriate metric from data is a research topic on its own called metric learn- 



ing ( Yang and Jin 2006 ) . A common idea is to find a metric which minimizes 



distances between examples of a single category (intra-class) and maximizes 
distances between different categories (inter-class). Applying a similarity mea- 
sure to another task is straightforward when using a nearest neighbor classifier. 



Fink (2004) used the metric learning algorithm of Shalev-Shwartz et al. (2004), 



which allows online learning and estimates the correlation matrix of a Maha- 
lanobis distance. 

Metric learning for visual identification tasks is presented by |Ferencz et al.| 
( 2008 ) . They show how to find discriminative local features which can be used 



to compare different instances of an object category, e. g., distinguishing between 
specific instances of a car. In this work, metric learning is based on learning a 
binary classifier which estimates the probability that both images belong to the 
same object instance. The obtained similarity measure can be used for visual 
identification with only one single training example for each instance. Another 
application of metric learning is domain adaptation as explained in Section [2] 
The paper of Saenko et al. (2010) presents a new database for testing domain 
adaptation methods and also gives results of a metric learning algorithm. In 



contrast, Jain and Learned-Miller (2011 ) propose to perform domain adaptation 



by applying Gaussian process regression on scores of examples near the decision 
boundary. 



Shared Features Visual appearance can be described in terms of features 
such as color, shape and local parts. Thus, it is natural to transfer information 
about the relevance of features for a given task. This idea can be generalized 
to shared base classifiers which allow modeling feature relevance. |Fink et al.| 
(2006) study combining perceptron-like classifiers of the support and target 
task. Due to the decomposition into multiple base classifiers (weak learners), 



ensemble methods and especially Boosting approaches (Freund and Schapire 



1997) are suitable for this type of knowledge transfer. Levi et al. (2004) extend 
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the standard Boosting framework by integrating task-level error terms. Weak 
learners which achieve a small error on all tasks should be preferred to very spe- 
cific ones. A similar concept is used in the work of Torralba et al. (2007), who 



propose learning multiple binary classifiers jointly with a Boosting extension 
called JointBoost. The algorithm tries to find weak learners shared by multiple 
categories. This also leads to a reduction of the computation time needed to 
localize an object with a sliding-window approach. Experiments of Torralba 



et al. ( 2007 ) show that the number of feature evaluations grows logarithmic in 



the number of categories, which is an important benefit compared to indepen- 
dent learning. An extension of this approach to kernel learning can be found 
in Jiang et al. (2007). Salakhutdinov et al. (20111 exploits category hierarchies 



and performs feature sharing by combining hyperplane parameters. 



Shared Latent Space Finding discriminative and meaningful features is a 
very difficult task. Therefore, the assumption of shared features between tasks 
is often not valid for empirically selected features used in the application. |Quat-| 
toni et al. ( 2007 ) propose assuming an underlying latent feature space which is 



common to all tasks. They use the method of |Ando and Zhangj (|2005j) to es- 
timate a feature transformation from support tasks derived from caption texts 
of news images. To estimate relevant features, the subsequent work (Quattoni 



et al.[ |2008[ ) propose an eigenvalue analysis and the use of unlabeled data. 

Latent feature spaces can be modeled in a Bayesian framework using Gaus- 
sian processes, which leads to the so called Gaussian process latent variable 
model (GP-LVM) ( |Lawrence 2005). Incorporating the idea of a shared latent 



space into the GP-LVM framework allows using various kinds of noise models 
and kernels ( |Urtasun et al. 2008). 



Constellation Model and Hierarchical Bayesian Learning An approach 
for knowledge transfer between visual object categories was presented by |Fei-Fei| 
et al. ( 2006 1 . Their method is inspired by the fact that a lot of categories share 
common object parts which are often also in the same relative position. Based 



on a generative constellation model ( Fergus et al. 2003 1 they propose using 



maximum a posteriori estimation to obtain model parameters for a new target 
category. The prior distribution of the parameters corresponding to relative part 
positions and part appearance is learned from support tasks. The underlying 



idea can also be applied to a shape based approach ( |Stark et al. 2009). 

A prior on parameters shared between tasks is an instance of hierarchical 
Bayesian learning. Raina et al. ( 2006 ) used this concept to transfer covariance 



estimates of parts of the feature set. 



Joint Regularization A lot of machine learning algorithms such as SVM are 
not directly formulated in a probabilistic manner but as optimization problems. 
These problems often include regularization terms connected to the complexity 
of the parameter, which would correspond to a prior distribution in a Bayesian 
setting. The equivalent to hierarchical Bayesian learning as described in the last 
paragraph is a joint regularization term shared between tasks. 

Amit et al. (2007) propose using trace norm regularization of the weight 
matrix in a multi-class SVM approach. They show that this regularization is 
related to the assumption of a shared feature transformation and task-specific 
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weight vectors with independent regularization terms. Instead of transferring 
knowledge between different classification tasks this work concentrates on trans- 
fer learning in a multi-class setting, i.e. multi-class transfer. 

Sharing a low dimensional data representation for multitask learning is the 
idea of Argyriou et al. (2006). The proposed optimization problem learns a 



feature transformation and a weight vector jointly and additionally favors sparse 
solutions by utilizing an regularization. Multitask learning with kernel 
machines was first studied by |Evgeniou et al. ( 2005 ) . Their idea is to reduce 



the multitask problem to a single task setting by defining a combined kernel 
function or multitask kernel and a new regularizes 



Shared Prior on Latent Functions The framework of Gaussian processes 
allows modeling a prior distribution of an underlying latent function for each 



classifier ( Rasmussen and Williams 2005 ) using a kernel function. 



If we want to learn a set of classifiers jointly in a multitask setting, an appro- 
priate assumption is that all corresponding latent functions are sampled from 



the same prior distribution. Lawrence et al. (2004) suggest learning the hyper- 



parameters of the kernel function jointly by maximizing the marginal likelihood. 



This idea was also applied to image categorization tasks (Kapoor et al. 20101. 



A more powerful way of performing transfer learning is studied with multitask 



kernels originally introduced by |Evgeniou et al.| ( |2005[ ). |Bonilla et al.| ( |2007| ) use 
a parameterized multitask kernel that is the product of a base kernel comparing 
input features and a task kernel modeling the similarity between tasks and using 
meta or task-specific features. Task similarities can also be learned without ad- 
ditional meta features by estimating a non-parametric version of the task kernel 



matrix (Bonilla et al. 2008). A theoretical study of the generalization bounds 



induced by this framework can be found in Chai et al. (2008). [Schwaig hoferj 



et al. (20051 propose an algorithm and model to learn the fully non-parametric 



form of the multitask kernel in a hierarchical Bayesian framework. 



The semi-parametric latent factor model (SLFM) of Teh et al. (2005) is di 



rectly related to a multitask kernel. The latent function for each task is modeled 
as a linear combination of a smaller set of underlying latent functions. There- 
fore, the full covariance matrix has a smaller rank, which directly corresponds 
to the rank assumption of other transfer learning ideas (Amit et al. 2007). A 



more general framework which allows modeling arbitrary dependencies between 
examples and tasks using a graph-theoretic notation is presented by |Yu and 

cEulpool. 



Semantic Attributes and Similarities Transfer learning with very few ex- 
amples of the target task can be difficult due to the lack of data to estimate 
task relations and similarities correctly. Especially if no training data (neither 
labeled nor unlabeled) is available, other data sources have to be used to per- 
form transfer learning. This scenario is known as zero-shot learning and uses 
the concept of learning with attributes , an area which received much attention 
in recent years. The term attribute refers to category-specific features. 



Lampert et al. (2009) use a large database of human-labeled abstract at- 
tributes of animal classes (e.g. brown, stripes, water, eats fish). One idea 
is to train several attribute classifiers and use their output as new meta fea- 
tures. This representation allows recognizing new categories without real train- 
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ing images only by comparison with the attribute description of the category. 
The knowledge transferred from support tasks is the powerful discriminative 
attribute representation which was learned with all training data. A similar 



idea is presented by Larochelle et al. (2008) for zero-shot learning based on 
task-specific features. A theoretical investigation of zero-shot learning with an 



attribute representation is given by Palatucci et al. ( 2009 1 and concentrates on 
analysis with the concept of probable approximate correctness (PAC) ( |Vapnik 
2000|. 



Instead of relying on human-labeled attributes, internet sources can be used 
to mine attributes and relations. The papers of |Rohrbach et al | (|2010a|b|) com- 



pare different kinds of linguistic sources, such as WordNet (Pedersen et al 



2004), Google search, Yahoo and Flickr. A large-scale evaluation of their ap- 
proach can be found in Rohrbach et al. (2011). Attribute based recognition 



can help to generalize to new tasks or categories (Farhadi et al. 2009) which 



is otherwise difficult using a training set only equipped with ordinary category 
labels. Attributes can also help to boost the performance of object detection 
rather than image categorization as shown in (Farhadi et al. 2010 ). Their trans- 



fer learning approach heavily relies on model sharing of object parts between 
categories. 



Lampert and Kromer (20101 use a generalization of maximum covariance 



analysis to find a shared latent subspace of different data modalities. This can 
also be applied to transfer learning with attributes by regarding the attribute 
representation as a second modality. Beyond zero-shot learning semantic simi- 



larities are also used to guide regularization (Wang et al. 2010). 



Context Information Up to now, we only covered transfer learning in which 
knowledge is used from visually similar object categories or tasks. However, 
dissimilar categories can also provide useful information if they can be used 
to derive contextual information. For example it is likely to find a keyboard 
next to a computer monitor, which can be a valuable information for an object 
detector. Methods using contextual information always exploit dependencies 
between categories and tasks and are therefore a special case for knowledge 
transfer approaches. 



Fink and Perona (2003) propose training a set of object detectors simulta- 



neously with an extended version of the AdaBoost algorithm (Viola and Jones 



2004 ) and can be regarded as an instance of multitask learning. In each round 



of the boosting algorithm the map of detection scores is updated and used as an 
additional feature in subsequent rounds. A similar idea is presented by |Shotton| 
et al. (2008) for semantic segmentation, which labels each pixel of the image as 



one of the learned categories. The work of Hoiem et al. (2005 1 pursues the same 
line of research, but clearly separates the support and target tasks. In a first 
step geometric properties of image areas are estimated. The resulting labeling 
into planar, non-planar, and porous objects, as well as ground and sky areas 
can be used to further assist local detectors as high-level features. Contextual 
relationships between different categories can also be modeled directly with a 
conditional Markov random field ( CRF) as done by Rabinovich et al. ( 2007 ) and 



Galleguillos et al. ( 2008 1 on a region-based level for semantic segmentation. 
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3.2 Heterogeneous Transfer: Prom Where to Transfer? 



Automatically selecting appropriate support tasks from a large set is a difficult 
sub-task of transfer learning. Therefore, most of the previous work presented in 
this thesis so far assumes that support tasks are given in advance. An exception 
is the early work of Thrun and O'Sullivan (19961, which proposes the task 
clustering algorithm. Similarities between two tasks are estimated by testing 
the classifier learned on one task using data from the other task. Afterward, 
clustering can be performed with the resulting task similarity matrix. 



Mierswa and Wurst ( 2005 ) select relevant features for a target task by com- 



paring the weights of the SVM hyperplane with each of the available tasks. 
Therefore, the algorithm selects a similar but more robustly estimated feature 
representation. The work of Kaski and Peltonen ( 2007 1 performs transfer learn- 
ing with logistic regression classifiers and models the likelihood of each task as a 
mixture of the target task likelihood and a likelihood term which is independent 
of all other tasks. Due to the task-dependent weight, the algorithm can adapt 
to heterogeneous environments. In general, selecting support tasks is a model 



selection problem, therefore, techniques like leave-one-out are used (Tommasi 



and Caputo 2009 Tommasi et al. 20101. Heterogeneous tasks can also be han- 



dled within the regularization framework of Argyriou et al. ( 2006 1 by directly 



optimizing a clustering of the tasks ( Argyriou et al. 2008 ) 



4 Summary 

We gave a summary of current work done in the area of visual transfer learning. 
Although there is a huge number of papers dealing with the problem of trans- 
ferring information between tasks (or domains) , many of the methods share the 
same assumptions and underlying basic ideas. 
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